.. _6 steps:
SOEP-IS in 6 steps
*******************************
In this section we provide a brief guide describing the basic work with SOEP-IS. Data users who have never worked with the SOEP before can follow these steps relatively easily to
gain initial experience with the SOEP-IS. Of course, the best way to use SOEP-IS data depends on the research project and individual approaches to working with the data.
The following guide only offers basic tools to use the data quickly and easily. The statistical software used is Stata, but the principles can be applied in eny common software for data processing.
*Note: This is not a tutorial for statistical software. Basic knowledge of Stata is assumed.*
1. Data delivery and datasets
========================================
Below we provide a screenshot of the basic data delivery of SOEP-IS (version 2022). The standard data formats are Stata (.dta) and SPSS (.sav). The datasets are stored with English and German labels.
However, all datasets of the different formats can be found in the respective folders without subdirectories:
.. figure:: png/SUF_data.png
:align: center
Probably not all datasets are relevant for your research. For a quick overview, we have listed a short description and brief information on each dataset:
.. csv-table::
:file: docs/SOEP_IS_suffiles.csv
:header-rows: 1
:class: longtable
:widths: 1 3 2 2 2
*Please note that there was no SOEP-IS survey in 2021. For simplification, this is not considered in the "Years" column.*
.. _Finding what you need:
2. Finding what you need
========================================
You now have an overview of the various datasets and can search for the variables required. SOEP-IS comes with a lot of variables and also similar information in different datasets.
For example, income related data can be found in pl or in pgen. Which variables should be used? While this of course depends on the individual research work, there are different approaches
to find out which variables should be used. It might be useful to incorporate difference approaches into the work with the data. Some possible ways of finding variables or information
on the variables are described below. Not all of them are always necessary, but they offer different advantages and disadvantages.
**1. Search in the datasets**
Probably the most intuitive way to find variables is to open the dataset and search for keywords in the variable labels. However, this approach might be difficoult because it is unclear which datasets you
need, which words are included in the labels and also because similar information might exist in several datasets.
*Advantages:* quick, easy, intuitive
*Disadvantages:* requires knowledge of datasets and labels, provides no information on the related survey-question
**2. Search in the questionnaires**
There is a complete overview of the questionnaires of SOEP-IS on `paneldata.org/soep-is/instruments/ `_. You can access the English and German metadata-based questionnaires
(see column "Attachments" for PDFs), which contain the survey questions, the related variables and the respective datasets they are stored in. This offers the highest degree of transparency regarding what was actually
surveyed in the questionnaire. Since 2022, these questionnaires are being used as a direct programming template and thus represent the questionnaire in the best possible way. But because of that, they are
also rather technically and not always easy to understand. For example, long filter conditions sometimes extend beyond the end of the page. Another problem might be, that the dataset and variable names have
changed over the years. Older questionnaires might include the outdated names. In most cases, however, the renaming was minimal and can be traced relatively easily.
For example, letters were added (l0011 -> lb0011) or suffixes were added (l0011 -> lb0011_v1).
Besides the negative aspects of the questionnaires, they still offer the possibility to find related information and variables relatively quickly by searching for keywords in the documents.
*Advantages:* detailed information on survey questions over the years, related variables are easy to find
*Disadvantages:* not always up to date in older years, very technical (e.g. filters)
**3. Search in the codebooks**
For each dataset you can `find a codebook `_ with a short overview of each variable. This offers the possibility to get a first impression of the data without having
to access it. If possible, the codebooks also include the latest version of the question related to some variables. As some of the values are displayed, you can also search for value labels in the codebooks.
For variables with a large number of values, however, only some are displayed.
*Advantages:* overview of variables without access to data, limited search for value labels
*Disadvantages:* large PDF documents, overall limited information
**4. Search in SOEP-IS-Companion**
In the chapter :ref:`Topics of SOEP-IS` we provide a selection of the main topics of SOEP-IS and the respective variables. You can use this for finding some of the most relevant varaibles in SOEP-IS. However,
this is not a comprehensive list and SOEP-IS offers a lot of additional variables. This should therefore serve more as a starting point or for a quick search for general information.
*Advantages:* topic related, easy
*Disadvantages:* just a selection of content and variables
**5. Search in paneldata.org**
You can use `paneldata.org `_ to look for all kinds of information. Paneldata allows you to search for variables and to find more information about generated variables.
It offers comprehensive frequency counts, chronologies of variables, cross-study variable linkage via concepts, a syntax generator, and a topic list for content search in the SOEP.
*Advantages:* easy and comprehensive search, cross-study
*Disadvantages:* partially missing information/links, not all features are available for all data
3. Merging datasets
========================================
If you know which variables you want to use it might be that they are stored in different datasets. You can merge most of SOEP-IS data using the identifiers *pid, hid,* and *syear*.
In the following you can find a brief overview of these most important identifiers:
**pid**: Never Changing Person ID - Each individual in SOEP has an unambiguous and never changing pid. It is constant over the years and can be used across datasets and studies.
**hid**: Current Wave HH Number - The hid identifies the household of the respondents in a wave. The hid can change from year to year, e.g. when people leave or switch households.
However, each person only has one hid per year. The hid can be used across studies.
**syear**: The syear variable can be found in every dataset in the long format and indicates the survey year.
SOEP-Core already has an extensive guide on `the identifiers in SOEP data `_
**General guidelines**
- Merges with long datasets should probably include the *syear* variable
- Data on household level should probably be merged using *hid*
- Data on individual level should probably be merged using *pid*.
SOEP-Core already has an extensive guide on `how to merge SOEP data with Stata `_
4. Understanding the Data
========================================
Once you have all the variables you need, it is sometimes necessary to understand the origin and characteristics of the variables or distributions. Here are some tips that may help you understand the data:
- The options described at :ref:`Finding what you need` can be used to find more information
- In the FAQ chapter :ref:`FAQ Versions` you can find information about the meaning of variable names
- SOEP-Core provides a description of the `missing conventions `_
- It might be useful to merge the *sample1* variable and check whether specific items were only asked to certain samples
5. Weighting and imputation
========================================
Even if the normal SOEP-IS samples are random probability samples, the weights are needed to be able to draw conclusions for the total population.
This is due to the fact that not all people who are selected in the sampling process actually take part in the survey. The weights are used to compensate for the non-responses that bias the sample.
SOEP-IS provides weights on individual- and household-level. The variables *phrf* and *hhrf* can be found in *ppathl* and *hpathl*.
SOEP-IS also provides imputations of household income and individual gross and net income. The different imputed variables are stored in *hgen* and *pgen*.
6. Record Linkage and Specific Data Requests
==============================================
SOEP-IS offers the possibility to merge the data with other data sources. Specifically, SOEP-IS data can be linked to administrative data of the
`Institute for Employment Research (IAB) `_ and to
administrative data of the `German Pension Insurance `_.
However, in order for the linkage to be possible, the respondents have to consent to it. This consent has so far only been surveyed in the SOEP-IS in 2019 and 2024 (in 2020 only a small subsample received
the consent question).
Some variables are not included in the normal Data Distribution File. If you find items in the questionnaires or the documentation sites that are not part of the official data,
you can send us an e-mail and we may be able to make the data available individually. However, this depends on the respective data and the circumstances.
Our Community Management is also available to answer any other questions you may have. SOEP-IS specific questions may be forwarded to the responsible members of the team.
:ref:`Contact`